{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Getting Started with Microsoft's Presidio: A Step-by-Step Guide to Detecting and Anonymizing PII in Text\n",
        "In this tutorial, we will explore how to use Microsoft's Presidio, an open-source framework for detecting, analyzing, and anonymizing personally identifiable information (PII) in free-form text. You’ll learn how to:\n",
        "\n",
        "* Set up and install the necessary Presidio packages\n",
        "\n",
        "* Detect common PII entities such as names, phone numbers, and credit card details\n",
        "\n",
        "* Define custom recognizers for domain-specific entities (e.g., PAN, Aadhaar)\n",
        "\n",
        "* Create and register custom anonymizers (like hashing or pseudonymization)\n",
        "\n",
        "* Reuse anonymization mappings for consistent re-anonymization\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "RU1FSD88SNTY"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 1. Installing the libraries"
      ],
      "metadata": {
        "id": "g0_oOOz-Stp_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "To get started with Presidio, you'll need to install the following key libraries:\n",
        "\n",
        "* **presidio-analyzer**: This is the core library responsible for detecting PII entities in text using built-in and custom recognizers.\n",
        "\n",
        "* **presidio-anonymizer:** This library provides tools to anonymize (e.g., redact, replace, hash) the detected PII using configurable operators.\n",
        "\n",
        "* **spaCy NLP model (en_core_web_lg):** Presidio uses spaCy under the hood for natural language processing tasks like named entity recognition. The en_core_web_lg model provides high-accuracy results and is recommended for English-language PII detection."
      ],
      "metadata": {
        "id": "lQYu62CiS2pM"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "collapsed": true,
        "id": "ALTWsxgjWGT6",
        "outputId": "3e220be7-4ecd-49e5-85be-897c2a408c04"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Collecting presidio-analyzer\n",
            "  Downloading presidio_analyzer-2.2.358-py3-none-any.whl.metadata (3.2 kB)\n",
            "Collecting phonenumbers<9.0.0,>=8.12 (from presidio-analyzer)\n",
            "  Downloading phonenumbers-8.13.55-py2.py3-none-any.whl.metadata (11 kB)\n",
            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/dist-packages (from presidio-analyzer) (6.0.2)\n",
            "Requirement already satisfied: regex in /usr/local/lib/python3.11/dist-packages (from presidio-analyzer) (2024.11.6)\n",
            "Requirement already satisfied: spacy!=3.7.0,<4.0.0,>=3.4.4 in /usr/local/lib/python3.11/dist-packages (from presidio-analyzer) (3.8.7)\n",
            "Collecting tldextract (from presidio-analyzer)\n",
            "  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)\n",
            "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.0.12)\n",
            "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.0.5)\n",
            "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.0.13)\n",
            "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.0.11)\n",
            "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.0.10)\n",
            "Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (8.3.6)\n",
            "Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.1.3)\n",
            "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.5.1)\n",
            "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.0.10)\n",
            "Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.4.1)\n",
            "Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.16.0)\n",
            "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (4.67.1)\n",
            "Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.0.2)\n",
            "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.32.3)\n",
            "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.11.7)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.1.6)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (75.2.0)\n",
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (24.2)\n",
            "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.11/dist-packages (from spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.5.0)\n",
            "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from tldextract->presidio-analyzer) (3.10)\n",
            "Collecting requests-file>=1.4 (from tldextract->presidio-analyzer)\n",
            "  Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)\n",
            "Requirement already satisfied: filelock>=3.0.8 in /usr/local/lib/python3.11/dist-packages (from tldextract->presidio-analyzer) (3.18.0)\n",
            "Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.11/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.3.0)\n",
            "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.33.2)\n",
            "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (4.14.0)\n",
            "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.4.1)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.4.2)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.4.0)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.13.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2025.6.15)\n",
            "Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.3.0)\n",
            "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.11/dist-packages (from thinc<8.4.0,>=8.3.4->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.1.5)\n",
            "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (8.2.1)\n",
            "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.5.4)\n",
            "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.11/dist-packages (from typer<1.0.0,>=0.3.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (13.9.4)\n",
            "Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.21.1)\n",
            "Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.11/dist-packages (from weasel<0.5.0,>=0.1.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (7.1.0)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.0.2)\n",
            "Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.2.1)\n",
            "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (3.0.0)\n",
            "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (2.19.1)\n",
            "Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (1.17.2)\n",
            "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy!=3.7.0,<4.0.0,>=3.4.4->presidio-analyzer) (0.1.2)\n",
            "Downloading presidio_analyzer-2.2.358-py3-none-any.whl (114 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m114.9/114.9 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading phonenumbers-8.13.55-py2.py3-none-any.whl (2.6 MB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.6/2.6 MB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading tldextract-5.3.0-py3-none-any.whl (107 kB)\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m107.4/107.4 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hDownloading requests_file-2.1.0-py2.py3-none-any.whl (4.2 kB)\n",
            "Installing collected packages: phonenumbers, requests-file, tldextract, presidio-analyzer\n",
            "Successfully installed phonenumbers-8.13.55 presidio-analyzer-2.2.358 requests-file-2.1.0 tldextract-5.3.0\n",
            "Collecting presidio-anonymizer\n",
            "  Downloading presidio_anonymizer-2.2.358-py3-none-any.whl.metadata (8.1 kB)\n",
            "Requirement already satisfied: cryptography<44.1 in /usr/local/lib/python3.11/dist-packages (from presidio-anonymizer) (43.0.3)\n",
            "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography<44.1->presidio-anonymizer) (1.17.1)\n",
            "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography<44.1->presidio-anonymizer) (2.22)\n",
            "Downloading presidio_anonymizer-2.2.358-py3-none-any.whl (31 kB)\n",
            "Installing collected packages: presidio-anonymizer\n",
            "Successfully installed presidio-anonymizer-2.2.358\n",
            "Collecting en-core-web-lg==3.8.0\n",
            "  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m400.7/400.7 MB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hInstalling collected packages: en-core-web-lg\n",
            "Successfully installed en-core-web-lg-3.8.0\n",
            "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
            "You can now load the package via spacy.load('en_core_web_lg')\n",
            "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n",
            "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n",
            "order to load all the package's dependencies. You can do this by selecting the\n",
            "'Restart kernel' or 'Restart runtime' option.\n"
          ]
        }
      ],
      "source": [
        "!pip install presidio-analyzer\n",
        "!pip install presidio-anonymizer\n",
        "!python -m spacy download en_core_web_lg"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "You might need to restart the session to install the libraries."
      ],
      "metadata": {
        "id": "iQ1uRtkIS920"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 2. Presidio Analyzer"
      ],
      "metadata": {
        "id": "lO83He_9TDDW"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 2.1 Basic PII Detection\n",
        "In this block, we initialize the Presidio Analyzer Engine and run a basic analysis to detect a U.S. phone number from a sample text. We also suppress lower-level log warnings from the Presidio library for cleaner output.\n",
        "\n",
        "The AnalyzerEngine loads spaCy’s NLP pipeline and predefined recognizers to scan the input text for sensitive entities. In this example, we specify PHONE_NUMBER as the target entity."
      ],
      "metadata": {
        "id": "3V6mXAmYTHsd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import logging\n",
        "logging.getLogger(\"presidio-analyzer\").setLevel(logging.ERROR)\n",
        "\n",
        "from presidio_analyzer import AnalyzerEngine\n",
        "\n",
        "# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers\n",
        "analyzer = AnalyzerEngine()\n",
        "\n",
        "# Call analyzer to get results\n",
        "results = analyzer.analyze(text=\"My phone number is 212-555-5555\",\n",
        "                           entities=[\"PHONE_NUMBER\"],\n",
        "                           language='en')\n",
        "print(results)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tygT4nYnWmf-",
        "outputId": "0f84ed30-892b-4801-aa38-355fe509863d"
      },
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[type: PHONE_NUMBER, start: 19, end: 31, score: 0.75]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 2.2 Creating a Custom PII Recognizer with a Deny List (Academic Titles)\n",
        "This code block shows how to create a custom PII recognizer in Presidio using a simple deny list, ideal for detecting fixed terms like academic titles (e.g., \"Dr.\", \"Prof.\"). The recognizer is added to Presidio’s registry and used by the analyzer to scan input text.\n",
        "\n",
        "While this tutorial covers only the deny list approach, Presidio also supports regex-based patterns, NLP models, and external recognizers. For those advanced methods, refer to the official docs: [Adding Custom Recognizers.](https://microsoft.github.io/presidio/analyzer/adding_recognizers/)"
      ],
      "metadata": {
        "id": "o0GxUhmGTfsu"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry\n",
        "\n",
        "# Step 1: Create a custom pattern recognizer using deny_list\n",
        "academic_title_recognizer = PatternRecognizer(\n",
        "    supported_entity=\"ACADEMIC_TITLE\",\n",
        "    deny_list=[\"Dr.\", \"Dr\", \"Professor\", \"Prof.\"]\n",
        ")\n",
        "\n",
        "# Step 2: Add it to a registry\n",
        "registry = RecognizerRegistry()\n",
        "registry.load_predefined_recognizers()\n",
        "registry.add_recognizer(academic_title_recognizer)\n",
        "\n",
        "# Step 3: Create analyzer engine with the updated registry\n",
        "analyzer = AnalyzerEngine(registry=registry)\n",
        "\n",
        "# Step 4: Analyze text\n",
        "text = \"Prof. John Smith is meeting with Dr. Alice Brown.\"\n",
        "results = analyzer.analyze(text=text, language=\"en\")\n",
        "\n",
        "for result in results:\n",
        "    print(result)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1yXTMLKpW9yt",
        "outputId": "b6209a8e-fd1c-4c71-dca0-5e12c229211b"
      },
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "type: ACADEMIC_TITLE, start: 0, end: 5, score: 1.0\n",
            "type: ACADEMIC_TITLE, start: 33, end: 36, score: 1.0\n",
            "type: PERSON, start: 6, end: 16, score: 0.85\n",
            "type: PERSON, start: 37, end: 48, score: 0.85\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 3. Presidio Anonymizer"
      ],
      "metadata": {
        "id": "XjzwPCluT8KU"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 3.1 Anonymizing PII Using Presidio Anonymizer\n",
        "This code block demonstrates how to use the Presidio Anonymizer Engine to anonymize detected PII entities in a given text. In this example, we manually define two PERSON entities using RecognizerResult, simulating output from the Presidio Analyzer. These entities represent the names \"Bond\" and \"James Bond\" in the sample text.\n",
        "\n",
        "We use the \"replace\" operator to substitute both names with a placeholder value (\"BIP\"), effectively anonymizing the sensitive data. This is done by passing an OperatorConfig with the desired anonymization strategy (replace) to the AnonymizerEngine.\n",
        "\n",
        "This pattern can be easily extended to apply other built-in operations like \"redact\", \"hash\", or custom pseudonymization strategies."
      ],
      "metadata": {
        "id": "xOl-u946UGDy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from presidio_anonymizer import AnonymizerEngine\n",
        "from presidio_anonymizer.entities import RecognizerResult, OperatorConfig\n",
        "\n",
        "# Initialize the engine:\n",
        "engine = AnonymizerEngine()\n",
        "\n",
        "# Invoke the anonymize function with the text,\n",
        "# analyzer results (potentially coming from presidio-analyzer) and\n",
        "# Operators to get the anonymization output:\n",
        "result = engine.anonymize(\n",
        "    text=\"My name is Bond, James Bond\",\n",
        "    analyzer_results=[\n",
        "        RecognizerResult(entity_type=\"PERSON\", start=11, end=15, score=0.8),\n",
        "        RecognizerResult(entity_type=\"PERSON\", start=17, end=27, score=0.8),\n",
        "    ],\n",
        "    operators={\"PERSON\": OperatorConfig(\"replace\", {\"new_value\": \"BIP\"})},\n",
        ")\n",
        "\n",
        "print(result)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CJ2Kyw1AQUYv",
        "outputId": "94f45c40-5a6b-49aa-f08a-22c1a80f934a"
      },
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "text: My name is BIP, BIP\n",
            "items:\n",
            "[\n",
            "    {'start': 16, 'end': 19, 'entity_type': 'PERSON', 'text': 'BIP', 'operator': 'replace'},\n",
            "    {'start': 11, 'end': 14, 'entity_type': 'PERSON', 'text': 'BIP', 'operator': 'replace'}\n",
            "]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 4. Custom Entity Recognition, Hash-Based Anonymization, and Consistent Re-Anonymization with Presidio\n",
        "In this example, we take Presidio a step further by demonstrating:\n",
        "\n",
        "* ✅ Defining custom PII entities (e.g., Aadhaar and PAN numbers) using regex-based PatternRecognizers\n",
        "\n",
        "* 🔐 Anonymizing sensitive data using a custom hash-based operator (ReAnonymizer)\n",
        "\n",
        "* ♻️ Re-anonymizing the same values consistently across multiple texts by maintaining a mapping of original → hashed values\n",
        "\n",
        "We implement a custom ReAnonymizer operator that checks if a given value has already been hashed and reuses the same output to preserve consistency. This is particularly useful when anonymized data needs to retain some utility — for example, linking records by pseudonymous IDs."
      ],
      "metadata": {
        "id": "_mbBxMakUhp_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4.1 Define a Custom Hash-Based Anonymizer (ReAnonymizer)\n",
        "This block defines a custom Operator called ReAnonymizer that uses SHA-256 hashing to anonymize entities and ensures the same input always gets the same anonymized output by storing hashes in a shared mapping."
      ],
      "metadata": {
        "id": "MDqZL1AgU3rb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from presidio_anonymizer.operators import Operator, OperatorType\n",
        "import hashlib\n",
        "from typing import Dict\n",
        "\n",
        "class ReAnonymizer(Operator):\n",
        "    \"\"\"\n",
        "    Anonymizer that replaces text with a reusable SHA-256 hash,\n",
        "    stored in a shared mapping dict.\n",
        "    \"\"\"\n",
        "\n",
        "    def operate(self, text: str, params: Dict = None) -> str:\n",
        "        entity_type = params.get(\"entity_type\", \"DEFAULT\")\n",
        "        mapping = params.get(\"entity_mapping\")\n",
        "\n",
        "        if mapping is None:\n",
        "            raise ValueError(\"Missing `entity_mapping` in params\")\n",
        "\n",
        "        # Check if already hashed\n",
        "        if entity_type in mapping and text in mapping[entity_type]:\n",
        "            return mapping[entity_type][text]\n",
        "\n",
        "        # Hash and store\n",
        "        hashed = \"<HASH_\" + hashlib.sha256(text.encode()).hexdigest()[:10] + \">\"\n",
        "        mapping.setdefault(entity_type, {})[text] = hashed\n",
        "        return hashed\n",
        "\n",
        "    def validate(self, params: Dict = None) -> None:\n",
        "        if \"entity_mapping\" not in params:\n",
        "            raise ValueError(\"You must pass an 'entity_mapping' dictionary.\")\n",
        "\n",
        "    def operator_name(self) -> str:\n",
        "        return \"reanonymizer\"\n",
        "\n",
        "    def operator_type(self) -> OperatorType:\n",
        "        return OperatorType.Anonymize\n"
      ],
      "metadata": {
        "id": "e-KpAor1R2KX"
      },
      "execution_count": 17,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4.2 Define Custom PII Recognizers for PAN and Aadhaar Numbers\n",
        "We define two custom regex-based PatternRecognizers — one for Indian PAN numbers and one for Aadhaar numbers. These will detect custom PII entities in your text."
      ],
      "metadata": {
        "id": "WkmFYIeeU8e4"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern\n",
        "\n",
        "# Define custom recognizers\n",
        "pan_recognizer = PatternRecognizer(\n",
        "    supported_entity=\"IND_PAN\",\n",
        "    name=\"PAN Recognizer\",\n",
        "    patterns=[Pattern(name=\"pan\", regex=r\"\\b[A-Z]{5}[0-9]{4}[A-Z]\\b\", score=0.8)],\n",
        "    supported_language=\"en\"\n",
        ")\n",
        "\n",
        "aadhaar_recognizer = PatternRecognizer(\n",
        "    supported_entity=\"AADHAAR\",\n",
        "    name=\"Aadhaar Recognizer\",\n",
        "    patterns=[Pattern(name=\"aadhaar\", regex=r\"\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b\", score=0.8)],\n",
        "    supported_language=\"en\"\n",
        ")"
      ],
      "metadata": {
        "id": "rczUgSGhR4Fr"
      },
      "execution_count": 19,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4.3 Set Up Analyzer and Anonymizer Engines\n",
        "Here we set up the Presidio AnalyzerEngine, register the custom recognizers, and add the custom anonymizer to the AnonymizerEngine."
      ],
      "metadata": {
        "id": "nt9eNE-8VIFM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from presidio_anonymizer import AnonymizerEngine, OperatorConfig\n",
        "\n",
        "# Initialize analyzer and register custom recognizers\n",
        "analyzer = AnalyzerEngine()\n",
        "analyzer.registry.add_recognizer(pan_recognizer)\n",
        "analyzer.registry.add_recognizer(aadhaar_recognizer)\n",
        "\n",
        "# Initialize anonymizer and add custom operator\n",
        "anonymizer = AnonymizerEngine()\n",
        "anonymizer.add_anonymizer(ReAnonymizer)\n",
        "\n",
        "# Shared mapping dictionary for consistent re-anonymization\n",
        "entity_mapping = {}"
      ],
      "metadata": {
        "id": "PVrjhvsKVHQJ"
      },
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4.4 Analyze and Anonymize Input Texts\n",
        "We analyze two separate texts that both include the same PAN and Aadhaar values. The custom operator ensures they’re anonymized consistently across both inputs."
      ],
      "metadata": {
        "id": "VRm9CyWaVQs1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from pprint import pprint\n",
        "\n",
        "# Example texts\n",
        "text1 = \"My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123.\"\n",
        "text2 = \"His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F.\"\n",
        "\n",
        "# Analyze and anonymize first text\n",
        "results1 = analyzer.analyze(text=text1, language=\"en\")\n",
        "anon1 = anonymizer.anonymize(\n",
        "    text1,\n",
        "    results1,\n",
        "    {\n",
        "        \"DEFAULT\": OperatorConfig(\"reanonymizer\", {\"entity_mapping\": entity_mapping})\n",
        "    }\n",
        ")\n",
        "\n",
        "# Analyze and anonymize second text\n",
        "results2 = analyzer.analyze(text=text2, language=\"en\")\n",
        "anon2 = anonymizer.anonymize(\n",
        "    text2,\n",
        "    results2,\n",
        "    {\n",
        "        \"DEFAULT\": OperatorConfig(\"reanonymizer\", {\"entity_mapping\": entity_mapping})\n",
        "    }\n",
        ")"
      ],
      "metadata": {
        "id": "qFA5FNRnVOEH"
      },
      "execution_count": 21,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4.5 View Anonymization Results and Mapping\n",
        "Finally, we print both anonymized outputs and inspect the mapping used internally to maintain consistent hashes across values."
      ],
      "metadata": {
        "id": "7h8MbPqzVWI8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"📄 Original 1:\", text1)\n",
        "print(\"🔐 Anonymized 1:\", anon1.text)\n",
        "print(\"📄 Original 2:\", text2)\n",
        "print(\"🔐 Anonymized 2:\", anon2.text)\n",
        "\n",
        "print(\"\\n📦 Mapping used:\")\n",
        "pprint(entity_mapping)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OQ5mW_J0VU3l",
        "outputId": "1a6ab206-1d83-4429-ae51-5a505de3622a"
      },
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "📄 Original 1: My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123.\n",
            "🔐 Anonymized 1: My PAN is <HASH_6442fd73a9> and Aadhaar number is <HASH_08e9d6b34c>.\n",
            "📄 Original 2: His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F.\n",
            "🔐 Anonymized 2: His Aadhaar is <HASH_08e9d6b34c> and PAN is <HASH_6442fd73a9>.\n",
            "\n",
            "📦 Mapping used:\n",
            "{'AADHAAR': {'1234-5678-9123': '<HASH_08e9d6b34c>'},\n",
            " 'DATE_TIME': {'1234-5678-9123': '<HASH_08e9d6b34c>'},\n",
            " 'IN_PAN': {'ABCDE1234F': '<HASH_6442fd73a9>'}}\n"
          ]
        }
      ]
    }
  ]
}